Emmanuel, Adomas, Subhayan and Chuang
Hidradenitis suppurativa (HS) - inflammatory skin disease linked to immune dysregulation and abnormalities in follicular structure and function.
Data - Lesional and non-lesional skin samples from 20 patients
Study objective - Find an expression-based disease signature of HS by observing affected and unaffected skin samples.
Methods - Differential expression and pathway enrichment analyses.
The dataset used for the study isn’t particularly diverse - for instance, only 3 males were sampled, all of whom were either active or ex-smokers.
We have two different files from the Gene Expression Omnibus (GEO), one containing the metadata and the other the raw gene counts from the RNAseq.
# Tidy columns (with no prefixes)
meta_table <- meta_table |>
mutate(
Tissue = gsub("tissue: ", "", Tissue),
Tissue_type = gsub("tissue_type: ", "", Tissue_type),
Patient_id = gsub("patient_id: ", "", Patient_id),
Age = gsub("age: ", "", Age),
Gender = gsub("gender: ", "", Gender),
Race = gsub("race: ", "", Race),
Smoker= gsub("smoker: ", "", Smoker),
BMI= gsub("bmi: ", "", BMI),
Sample = str_split_i(Sample, pattern="_", 2),
)Linear regression on a binary value (perilesion or lesion) we want to see which gene expression varies the most.
Many collagens in the most upregulated genes
We can see on a volcano plot the correlation between expression change and significance
A small relative change in expression implies that the gene is not significant (Bonferonni correction)
We can see on a volcano plot the correlation between expression change and significance
A small relative change in expression implies that the gene is not significant (Bonferonni correction)
We can see on a volcano plot the correlation between expression change and significance
A small relative change in expression implies that the gene is not significant (Bonferonni correction)
We want to retrieve the name of the genes we are analyzing (instead of the Ensemble names)
We use the database org.HS.eg.db with the package AnnotationDbi (to manipulate the database)
# Find the corresponding gene SYMBOL for all ENSEMBL in org.HS.eg.db
mapping <- AnnotationDbi::select(org.Hs.eg.db,
keys = keys(org.Hs.eg.db, keytype = "ENSEMBL"),
keytype = "ENSEMBL",
columns = c("SYMBOL", "GENENAME", "CHR")
) |>
as_data_frame()
# Use left_join from tidyverse
annotated_data <- expression_data |>
left_join(mapping, by = c('gene' = 'ENSEMBL'))
no_annotation_genes <- annotated_data |>
filter(is.na(SYMBOL))
# Antijoin keep all element that is not in the no_annotation_genes
annotated_data <- annotated_data |>
anti_join(no_annotation_genes,
by = 'gene')
# Genes with multiple annotation would have more than 20 entrees
duplicates <- annotated_data |>
group_by(gene) |>
filter(n() > 20)
mult_annotated_genes <- duplicates |>
dplyr::select(gene, SYMBOL, GENENAME) |>
distinct() |>
ungroup()
annotated_data <- annotated_data |>
anti_join(mult_annotated_genes, by = 'gene') Some samples have more non-lesional reads, while some have more lesional reads.
Interesting - No downregulated genes in the Y chromosome